Although various methods have been proposed for multi-label classification, most approaches still follow the feature learning mechanism of the single-label (multi-class) classification, namely, learning a shared image feature to classify multiple labels. However, we find this One-shared-Feature-for-Multiple-Labels (OFML) mechanism is not conducive to learning discriminative label features and makes the model non-robustness. For the first time, we mathematically prove that the inferiority of the OFML mechanism is that the optimal learned image feature cannot maintain high similarities with multiple classifiers simultaneously in the context of minimizing cross-entropy loss. To address the limitations of the OFML mechanism, we introduce the One-specific-Feature-for-One-Label (OFOL) mechanism and propose a novel disentangled label feature learning (DLFL) framework to learn a disentangled representation for each label. The specificity of the framework lies in a feature disentangle module, which contains learnable semantic queries and a Semantic Spatial Cross-Attention (SSCA) module. Specifically, learnable semantic queries maintain semantic consistency between different images of the same label. The SSCA module localizes the label-related spatial regions and aggregates located region features into the corresponding label feature to achieve feature disentanglement. We achieve state-of-the-art performance on eight datasets of three tasks, \ie, multi-label classification, pedestrian attribute recognition, and continual multi-label learning.
translated by 谷歌翻译
In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm. We find that pre-trained speech models optimally encode language discriminatory information in lower layers. Further, we demonstrate that the embeddings obtained from these layers are significantly robust to classify unseen languages and different acoustic environments without additional training. After fine-tuning a pre-trained Conformer model on the VoxLingua107 dataset, we achieve results similar to current state-of-the-art systems for language identification. More, our model accomplishes this with 5x less parameters. We open-source the model through the NVIDIA NeMo toolkit.
translated by 谷歌翻译
This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (\url{}) toolkit.
translated by 谷歌翻译
translated by 谷歌翻译
视频对象检测一直是计算机视觉中一个重要但充满挑战的话题。传统方法主要集中于设计图像级或框级特征传播策略以利用时间信息。本文认为,通过更有效,更有效的功能传播框架,视频对象探测器可以在准确性和速度方面提高。为此,本文研究了对象级特征传播,并提出了一个针对高性能视频对象检测的对象查询传播(QueryProp)框架。所提出的查询Prop包含两个传播策略:1)查询传播是从稀疏的钥匙帧到密集的非钥匙框架执行的,以减少非钥匙帧的冗余计算; 2)查询传播是从以前的关键帧到当前关键框架进行的,以通过时间上下文建模来改善特征表示。为了进一步促进查询传播,自适应传播门旨在实现灵活的钥匙框架选择。我们在Imagenet VID数据集上进行了广泛的实验。 QueryProp通过最先进的方法实现了可比的精度,并实现了不错的精度/速度权衡。代码可在上获得。
translated by 谷歌翻译
最近,多模态命名实体识别(MNER)引起了很多关注。大多数工作通过从预训练对象检测器获得的区域级视觉表示使用图像信息,并依赖于注意力机制来模拟图像和文本表示之间的交互。然而,难以模拟这种交互,因为图像和文本表示分别在其各自的模态的数据上训练,并且在相同的空间中不对齐。由于文本表示在MNER中取得最重要的作用,在本文中,我们提出了{\ bf i} mage - {\ bf t} ext {\ bf a} lignments(ita)将图像特征对准到文本空间中,这样可以更好地利用基于变压器的预磨削文本嵌入的注意机制。 ITA首先在本地和全局将区域对象标记和图像级标题视为可视上下文,将其与输入文本连接为新的跨模型输入,然后将其送入预训练的文本嵌入模型。这使得预先训练的文本嵌入模型的注意模块更容易模拟两个模态之间的交互,因为它们都在文本空间中表示。 ITA进一步对齐从跨模型输入和文本输入视图预测的输出分布,使得MNER模型可以更实用和鲁棒到图像中的噪声。在我们的实验中,我们表明ITA模型可以在多模态命名实体识别数据集上实现最先进的准确性,即使没有图像信息也是如此。
translated by 谷歌翻译
受到深入学习的巨大成功通过云计算和边缘芯片的快速发展的影响,人工智能研究(AI)的研究已经转移到计算范例,即云计算和边缘计算。近年来,我们目睹了在云服务器上开发更高级的AI模型,以超越传统的深度学习模型,以造成模型创新(例如,变压器,净化家庭),训练数据爆炸和飙升的计算能力。但是,边缘计算,尤其是边缘和云协同计算,仍然在其初期阶段,因为由于资源受限的IOT场景,因此由于部署了非常有限的算法而导致其成功。在本调查中,我们对云和边缘AI进行系统审查。具体而言,我们是第一个设置云和边缘建模的协作学习机制,通过彻底的审查使能够实现这种机制的架构。我们还讨论了一些正在进行的先进EDGE AI主题的潜在和实践经验,包括预先训练模型,图形神经网络和加强学习。最后,我们讨论了这一领域的有希望的方向和挑战。
translated by 谷歌翻译
Designing accurate and efficient ConvNets for mobile devices is challenging because the design space is combinatorially large. Due to this, previous neural architecture search (NAS) methods are computationally expensive. ConvNet architecture optimality depends on factors such as input resolution and target devices. However, existing approaches are too resource demanding for case-by-case redesigns. Also, previous work focuses primarily on reducing FLOPs, but FLOP count does not always reflect actual latency. To address these, we propose a differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize Con-vNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. FBNets (Facebook-Berkeley-Nets), a family of models discovered by DNAS surpass state-of-the-art models both designed manually and generated automatically. FBNet-B achieves 74.1% top-1 accuracy on ImageNet with 295M FLOPs and 23.1 ms latency on a Samsung S8 phone, 2.4x smaller and 1.5x faster than MobileNetV2-1.3[17] with similar accuracy. Despite higher accuracy and lower latency than MnasNet[20], we estimate FBNet-B's search cost is 420x smaller than MnasNet's, at only 216 GPUhours. Searched for different resolutions and channel sizes, FBNets achieve 1.5% to 6.4% higher accuracy than Mo-bileNetV2. The smallest FBNet achieves 50.2% accuracy and 2.9 ms latency (345 frames per second) on a Samsung S8. Over a Samsung-optimized FBNet, the iPhone-Xoptimized model achieves a 1.4x speedup on an iPhone X. FBNet models are open-sourced at https://github. com/facebookresearch/mobile-vision. * Work done while interning at Facebook.… Figure 1. Differentiable neural architecture search (DNAS) for ConvNet design. DNAS explores a layer-wise space that each layer of a ConvNet can choose a different block. The search space is represented by a stochastic super net. The search process trains the stochastic super net using SGD to optimize the architecture distribution. Optimal architectures are sampled from the trained distribution. The latency of each operator is measured on target devices and used to compute the loss for the super net.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at
translated by 谷歌翻译